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Abstract. Laplacian matrices of graphs arise in large-scale computational applications such as 
machine learning; spectral clustering of images, genetic data and web pages; transportation network 
flows; electrical resistor circuits; and elliptic partial differential equations discretized on unstructured 
grids with finite elements. A Lean Algebraic Multigrid (LAMG) solver of the linear system Ax = 6 is 
presented, where A is a graph Laplacian. LAMG's run time and storage are linear in the number of 
graph edges. LAMG consists of a setup phase, in which a sequence of increasingly- coarser Laplacian 
systems is constructed, and an iterative solve phase using multigrid cycles. General graphs pose 
algorithmic challenges not encountered in traditional applications of algebraic multigrid. LAMG 
combines a lean piecewise-constant interpolation, judicious node aggregation based on a new node 
proximity definition, and an energy correction of the coarse-level systems. This results in fast conver- 
gence and substantial overhead and memory savings. A serial LAMG implementation scaled linearly 
for a diverse set of 1666 real-world graphs with up to six million edges. This multilevel methodology 
can be fully parallelized and extended to eigenvalue problems and other graph computations. 
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1. Introduction. Let G — {Af, £, w) be a weighted undirected graph, where J\f 
is a set of n nodes, £ is a set of m edges, and w : £ ^ is a weight function. 

The Laplacian matrix A„xn is naturaUy defined in terms of its quadratic energy, 
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where T denotes the transpose operator. In matrix form, 
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Au := {v : {u,v) (E £} . (1.2) 
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Fig. 1.1. A 5-node graph with a negative weight. The Laplacian is still semi-positive definite here. 
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A is zero row-sum, Symmetric Positive Semi-definite (SPS), and has 2m + 1 non- 
zeros. Typically, m <^v? and A is sparse. Unlike graph-theoretic works [HI 1311 EZ] , 
w is not assumed to be strictly positive: we also include SPS matrices with positive 
off-diagonal entries such as high-order and anisotropic grid discretizations [10 . 

A is singular: if G has M connected components whose node sets are TVi, ...,Nm, 
then A has an eigenvalue A = of multiplicity M and eigenvectors Ui, . . . , vlm, where 
Ui is the characteristic function of Mi- In a singly connected graph, the null space is 
spanned by the vector of ones, ui =: u. We consider the compatible linear system 



where b, a G are given s.t. U'^b = 0, and x e is the vector of unknowns. 
(|1.3|) has a unique solution [SUl pp. 185-186]. Our goal is to develop a numerical 
iterative solver of (|1.3|) that requires 0{m) storage and 0(mlog(l/e)) operations to 
generate an e-accurate solution, for graphs arising in real-world applications. The 
hidden constants should be small (in the hundreds, not millions). The solver should 
be parallelizable, and require a smaller cost to re-solve the system for multiple b's - 
a useful feature for time-dependent and other applications. 

1.1. Applications. The linear system ()1.3|) is fundamental to many applica- 
tions, briefly sketched below. See Spielman's review [46, §2] for more details. 

Elliptic Partial Differential Equations (PDEs) discretized on unstructured grids 
by the Finite Element Method (FEM) 4,, typically solved at each time step of a 
time-dependent Computational Fluid Dynamics (CFD) simulation [26]. x represents 
pressure, w is the material's diffusion coefficient, and b is a forcing term. 

Network flows. The Maximum Flow and Minimum Cost Flow problems are linear 
programming problems that, if solved by interior point algorithms, reduce to solving 
a sequence of restricted Laplacian systems [TH [17] . 

Electrical networks. In the study of electrical flow through a resistor network G, 
nodes are electrical components, x is the electrical potential, Wuv is the conductance 
between u and v, and b is a vector of external currents. 

A stepping stone toward the eigenproblem. Our multilevel methodology is appli- 
cable to finding the smallest eigenpairs of A with minor adaptations (cf. i i5.4p . The 
Laplacian eigenproblem is central to graph regression and classification in machine 
learning [171 [53] ; spectral clustering of images and graph embedding [44] ; dimension 
reduction for genetic ancestry discovery [35] ; and spectral graph theory. Of particular 
interest is the Fiedler value - the smallest non-zero eigenvalue and the corresponding 
Fiedler eigenvector, which measure the algebraic connectivity of G [18, §1.1] and re- 
lated to minimum cuts Although we believe it preferable to develop multiscale 
strategies for the original formulations of these graph problems, as demonstrated by 
the work [ID] (more examples are discussed in the papers [301 1331 ES] ) : a fast black-box 
eigensolver is a practical alternative. 

1.2. Related Work. There are two main approaches to solving (|1.3p : direct, 
leading to an exact solution; and iterative, which produces successive approximations 
X to x and typically requires 0(log(l/£)) iterations to achieve e-accuracy, namely. 



U^x = a , 



Ax b 



U := (ui I • ■ • I Um) , 



(1.3a) 
(1.3b) 



||x - x||a < e||x||A , 



||x||a:=v^- 



(1.4) 
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1.2.1. Direct Methods. The Cholesky factorization with a clever ehmination 
order can be apphed to A. A permutation matrix P is chosen and factorization 
P"^ AP = LL"^ constructed so that the lower triangular L is as sparse as possible, using 
Minimum or Approximate Minimum Degree Ordering |49[ [1] methods. Except for 
simple graphs, direct algorithms do not scale, requiring 0{n}'^) operations for planar 
graphs and 0{n^) in general. Alternatively, fast matrix inversion can be performed 
in 0(ri^-^^^) or combined with Cholesky, yet yields similar complexities 46, §3.1]. 

1.2.2. Iterative Methods: Graph Theoretic. These are variants of the Pre- 
conditioned Conjugate (PCG) method [Ml §10.3] that achieve (HH) in k{AB~'^) 
log(l/£)) iterations for a preconditioner B; k is the finite condition number |46[ §3.3]. 

Recent works |32[I47! have been focusing on multilevel graph-sparsifying precondi- 
tioners. The graph is repeatedly partitioned into sets of high-conductance nodes with- 
out removing too many edges. The complexity is a near-linear 0(mlog^ rilog(l/e)), 
guaranteed for any symmetric diagonally-dominant A. Unfortunately, no implemen- 
tation are available yet, nor is there a guarantee on the size of the hidden constant. 

1.2.3. Iterative Methods: AMG. Algebraic Muhigrid (AMG) is a class of 
high-performance linear solvers, originating in the early 1980s [T^ §1-1], [S]) III] and 
still under active development. During a setup phase, AMG recursively automati- 
cally constructs a multi-level hierarchy of increasingly coarser graphs by examining 
matrix entries, without relying on geometric information. The solve phase consists of 
standard multigrid cycles. AMG can be employed either as a stand-alone solver or 
as a PCG preconditioner [SUl App. A]. Open-source parallel implementations such as 
Hypre [15] and Trilinos-PETSc [3T] are available. In classical AMG, the coarse set is 
a subset of TV; popular alternatives are aggregation AMG [50l App. A.9],|6],[3] and 
smoothed aggregation [16], where the coarse nodes are aggregates of fine nodes. 

AMG mainly targets discretized elliptic PDEs on unstructured grids [26]. Re- 
cent works have been focusing on improving the coarsening and interpolation to in- 
crease the solution efficiency. These include Bootstrap AMG [51 §17.2], [TT] adaptive 
smoothed aggregation [15) and interpolation energy minimization schemes such as Ol- 
son and his associates' [39]. While these methods serve to increase AMG's scope and 
approach linear scaling for more systems, they are not designed for general graphs, as 
will be explained in !j2] The present work aims at generalizing AMG to graph Lapla- 
cians and addresses peculiarities not encountered in traditional AMG applications. 

1.3. Our Contribution. We present Lean Algebraic Multigrid (LAMG): a 
practical graph Laplacian solver. LAMG attains optimal efficiency: its setup phase 
requires 0(m) time and storage, and solve phase requires 0(m log(l/e)) operations 
per Right-Hand Side (RHS). An unoptimized Matlab LAMG implementation scaled 
linearly for 1666 real- world graphs with up to six million edges, ranging from com- 
putational fluid dynamics to social networks. While we do not prove nor claim that 
LAMG works for every graph, these results will hopefully support its practical use. 

LAMG is an aggregation- AMG algorithm [SO] App. A. 9] composed of lean com- 
ponents that significantly decrease setup time and memory usage and boost the solve 
phase efficiency. The key design decision is the choice of a caliber- 1 (piecewise- 
constant) interpolation between levels. Fast asymptotic convergence is achieved by 
(a) a new relaxation-based node proximity definition, which guides the aggregation and 
improves upon the algebraic distance defined by Ron et al. (40) : and (b) an energy 
correction applied to coarse-level systems. We offer two alternatives: a fiat correc- 
tion similar to Braess' work 6 , yet resulting in a superior efficiency; or an adaptive 
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correction via multilevel iterant recombination [501 §7.8.2] that is even more efficient. 

Importantly, the developed multilevel methodology is extensible to eigenvalue 
problems, other linear systems and other graph computational problems. 

1.4. Paper Organization. We begin by elucidating the general issues of devel- 
oping AMG for the graph Laplacian in ij2l The LAMG algorithm is explained in ^Sl 
comprising of a setup phase ( N3.ip and a solve phase ( ij3.6l) . Numerical results are 
presented in 2] Future enhancements and extensions are outlined in fj5] 

2. General Considerations. It is important to first understand the pitfalls of 
existing AMG algorithms in general graphs and their remedies in LAMG. 

2.1. Node Proximity. The construction of an effective coarse node set hinges 
upon defining which nodes in TV are "proximal", i.e., nodes whose values are strongly 
coupled in all smooth (low-energy) vectors [SUJ p. 473]. Table [^?T] lists three definitions. 



Classical AMG 


\wuv\/ maxjmaxs u;„s ,maxs \wsv\} 


1/ Algebraic Distance 


1/ (^max^;^ x^u'' - x^v^ j 


Affinity 


Cuv / max {maxs^^u c^s, niaxs;^^ Cst,} , 

Cuv '■= 1 {(XuT^u) (A't,,^!,) ^ , 



Table 2.1 

Comparison of node proximity measures. Nodes are defined as "close" when the measure exceeds 
a certain threshold. 




Classical AMG defines proximity based on matrix entries fTable I^TTl top row). 
While this has worked well for coarsening discretized scalar elliptic PDEs, it leads 
to wrong aggregation decisions in non-local graphs. In a grid graph with an extra 
link between distant nodes u, v (Fig. 12.1b ). u and v become proximal and may be 
aggregated. Unless Wuv is outstandingly large, this is undesirable because u and v 
belong to unrelated milieus of the grid. 

This problem is overcome by the algebraic distance measure introduced by Ron et 
al. [40] (Table [231 middle row; a related definition is used in the work [lO]). A set of 
K relaxed Test Vectors (TVs) x'^^^ . . . ,x'^^^ is generated, each obtained by applying 
several Gauss-Seidel (GS) relaxation sweeps to Ax — 0, starting from randoni[— 1, 1] 
and normalizing the result. Yet this definition falls pray to a graph containing two 
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connected suns (high-degree nodes) u and v, each of which connecting many satelhtes 
(Fig. l2.1b V For each fc, the value x^u ^ is an average over a large neighborhood of 
nodes whose size increases with the number of sweeps, and will be small. Similarly, 

(k) 

Xv is small, so every pair of suns is always proximal even though they may represent 
distant node clusters. 

LAMG's proximity measure is the affinity f Table \2A\ last row), which also relies 
on the same TVs, yet is scale-invariant and correctly assesses both cases of Fig. 12.11 
as well as many other constellations. 

2.2. Interpolation Caliber. Textbook multigrid convergence for the Poisson 
equation requires that the interpolation of corrections P be second-order [7j §3.3]. 
The analogous AMG theory implies a similar condition on the interpolation accuracy 
of low-energy errors. While a piecewise-constant P is acceptable in a two-level cycle, 
it is insufhcient in V-cycles; W-cycles are faster but more costly [W, p. 471]. 

Constructing a second-order P is a challenge even for grid graphs. Often a first- 
order interpolation is constructed, followed by a smoothing step [16] or a scheme that 
reduces the interpolation's energy [11] The situation is exacerbated in graphs 
with no geometric structure: the interpolation order is undefinable. Even had the 
graph's effective dimension d been known, the required interpolation caliber, i.e., the 
number of coarse nodes used to interpolate a fine node, would grow with d and result 
in unbounded interpolation complexity. Finally, choosing a proper interpolation set 
(whose "convex hull" contains the fine node) is a complex, costly endeavor [11] . 

In contrast, LAMG employs a caliber- 1 (piecewise-constant) P, and corrects the 
energy of the coarse-level Galerkin operator to maintain good convergence (in prac- 
tice, the correction is actually applied to the coarse RHS). This could not have been 
achieved within the variational setting, which only permits modifying P. Whereas 
high-caliber works focus on optimizing P, here the barrier to fast convergence is the 
coarse-to-fine operator energy ratio. Our contribution is an algorithm that yields a 
small energy ratio, which translates into optimal efficiency; cf. §3S!1HSIS1 The Com- 
patible Relaxation (CR) performance predictor [1 El [11], [HI §§14.2-14.3] is not 
relevant for low interpolation accuracy; the energy ratio is a better predictor. 

2.3. Coarse-level Fill-in. Frequently, the coarse-level matrices in AMG hierar- 
chies become increasingly dense. This is a result of a poor aggregation, a high-caliber 
P, or both: many fine nodes whose neighbor sets are disjoint are aggregated, creat- 
ing additional edges among coarse-level aggregates. This renders the ideally-accurate 
interpolation irrelevant, because the actual cycle efficiency (error reduction per unit 
work) is small even though convergence may be rapid. While fill-in is often man- 
ageable in grid graphs because the coarse graphs are still local, it is detrimental in 
non-local graphs. 

LAMG's interpolation is designed to create an insignificant fill-in: the sparser P, 
the sparser the Galerkin operator P^AP. The affinity-based aggregation ( §3.4|) also 
helps, as it tends to aggregate nodes with many common neighbors. The cycle work is 
further controlled by a fractional cycle index [12, §6.2] between 1 and 2; cf. i!i )3.1l3.6l 

Occasionally, the interpolation caliber may be slightly increased as long as the 
number of coarse edges does not become too large; see ^5.'2\ 

2.4. Extenuating Circumstances. Specific properties of the graph Laplacian 
can be exploited to simplify the LAMG construction. 

• GS is an effective smoother in SPS systems [T^l §1]. Also, the quadratic form 
(jl.ip is handy for discerning and solving the energy infiation problem fi i3.5p . 
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• Since A has zero row sums, its null-space eigenvectors are constant in every 
connected component. Under the general AMG assumption that all near- null- 
space errors can be fitted by a single interpolation from a coarse level [121 
p. 8], the caliber-1 interpolation weights are apriori set to 1. (This assumption 
is easily verified for Laplacians with bounded node degrees [50l p. 439]. It is 
violated in wave equations, where multiple coarse grids are required [37].) 

• Some graph locales are effectively one-dimensional: many nodes have degree 
2-3. Such nodes can be quickly eliminated similarly to the paper [35] ( ii3.3p . 

• In other graphs GS is an efficient solver and no coarsening is required. These 
include complete graphs, star graphs and expander graphs ;46, §1]. 

2.5. Data Structures. The efficiency of iterative methods for ()1.3p depends 
on a proper storage format for A, as they compute many matrix-vector products. 
AMG algorithms also require other types of operations, e.g., removing a row and a 
corresponding column upon node aggregation. Standard choices such as Compressed 
Column storage (CCS) [U |5S] may not be optimal for these operations, especially 
in graphs containing suns. In our implementation, for instance, we do not recom- 
pute affinities upon a TV update in Algorithm H] because that would require a slow 
Matlab CCS matrix row update. 

Optimized architectures for such operations that best utilize the available hard- 
ware (e.g., CPU or GPU) may be pursued [26 . 

3. The LAMG Algorithm. Veracious to the algebraic multigrid framework, 
LAMG recursively constructs a hierarchy of L increasingly coarser Laplacian systems 
("levels") A'x' = b\ Z = 1, . . . , L during a setup phase. The finest level is the original 
system, A^ :— A,b^ :— h. The setup phase depends on A only, and constructs a 
sequence {(A',P')}^2; where A' is an n; x ni Laplacian and P' is an ri;_i x ni 
interpolation matrix from level I to I — 1. We also use the notation G' = (7V',f') for 
the graph corresponding to A', although it is not explicitly stored by our code. Each 
level Z > 2 is of type Elimination (obtained from the next-finer level by exact node 
elimination) or Aggregation (a caliber-1 node aggregation of fine-level nodes). The 
vectors {b'}^2 computed during the solve phase. 

3.1. Setup Phase. The setup flow is depicted in Fig. |3?TJ It requires two inputs: 

• Cycle index 7 > 1 to be employed at most levels of subsequent solution cycles. 

• Guard < g < 1, which bounds coarsening ratios and controls the cycle work. 
In our program, 7 = 1.5 and g = .7; these values are discussed in tJ3.6l 

Given the current coarsest Laplacian , we first estimate the speed of relaxation 
fi i3.2p . and terminate if it is fast enough. Otherwise, disconnected and low-degree 
nodes are eliminated from G' ( ij3.3p . If such nodes are found, an Elimination level 
is added to the hierarchy, and G''s disconnected nodes are amended to the list of G's 
connected graph components. Let / be the new coarsest level; next, TV' is aggregated 
to form Af''^^ , P''~^^ ( W3A\i . Finally, an energy-corrected Galerkin operator is 
constructed ( ^3.5p . Coarsening is repeated until ni is small enough or until relaxation 
is fast. Finally, G's connected components are assembled ( tj3.5.5p . 

3.2. Relaxation. Let I be the currently processed level during setup. For sim- 
plicity, we omit the ^-superscripts in this section. A Gauss-Seidel (GS) relaxation 
sweep for Ax = b is defined by the successive updates [T^ §1.1] 

roT u = 1, . . . ,n , Xu < . (3.1) 
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Fig. 3.1. LAMG setup phase flowchart. 



A GS solve iteration is a GS sweep followed by subtracting the mean of x from 
all {xu}u- Starting from xq = random[— 1, 1], we apply z/ = 15 GS solve itera- 
tions to Ax — and estimate the Asymptotic Convergence Factor (ACF) by p := 
||xy||2/||xjx-i||2, where x^ is the i"^ iterant and || • II2 is the I2 norm. If p < .7, I 
becomes the final coarsest level L. A slower GS solve implies that the graph is "stiff" , 
i.e. there exist low-energy errors whose magnitude is not reflected by their residuals 
[m §1.1], hence further coarsening is necessary. 

The speed check is skipped if G has disconnected (0-degree) nodes, since a division 
by would occur in (|3.ip . 



3.3. Low-degree Node Elimination. First, we eliminate from G all discon- 
nected nodes Z, and a set of low-degree nodes J-. ^-elimination is mandatory for GS 
to be properly defined on the remaining graph. The 7^-elimination ventures to reduce 
n while not significantly increasing m, at a small cost. This removes the 1-D part of 
the graph and enhances the efficiency of subsequent Aggregation levels (cf. H3.5^ . 

3.3.1. Choosing J'. T is an independent set of nodes u with degree \ Au\ < 4. 
Eliminating a node connects all its neighbors; hence 7^- nodes with \Au\ < 3 do not 
increase m (Table [3T|) . When \ Au\ — 4:, m might be increased by at most 2. However, 
we assume that this is unlikely to happen for many J- nodes, hence eliminate those 
as well (a costly alternative is to monitor the future change in m and only eliminate 
nodes that do not increase it). Larger \Au\ values result in an impractical fill-in. 

The pseudo-code of choosing is given in Algorithm [TJ Nodes are sequentially 
visited; when an J- node is marked, its neighbors become ineligible for inclusion in J^, 
which guarantees the set's independence. 
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\Au\ 




Elimination 


After 

Elimination 


1 


-1 


• — O 


• 


2 


-1 




• • 


3 







V 


4 


+2 


X 


x; 



Table 3.1 

Types of nodes ud T. For each type, the sub-graph of u (gray circle) and Au (black circles) is 
depicted before and after elimination in the worst fill-in case. Am is the maximum increase in m. 



Algorithm 1 J' = LowDegreeNodes{A) 

1: A/" ^ nodes(A), n -h- size(A), ^ {u e A/" : 1 < < 4} . 

2: For each u eU, visited{u) NotVisited. 

3: for each u gU do 

4: if {visited{u) = NotVisited) then 

5: if G Au ■ visited{v) — FNODE then {u may be ehminated} 

6: visited{u) <— FNODE. 

7: For each v e Au, visited{v) <— NotEliminated. 

8; else {u has an J^-neighbor} 

9; visited{u) <~ NotEliminated. 

10: end if 

11: end if 

12: end for 

13: return {u eU : visited{u) — FNode}. 











A^c 




Acc 




3.3.2. Elimination Equations. Let C :^ M {Z L) J-) denote the rest of the 
nodes, and permute rows and columns so that ()1.3ap becomes 



(3.2) 



Note that Ajrjr is diagonal. Block Gaussian elimination of the Z- and 7^-blocks 
transforms p.2p into 

= (3.3a) 

- A^^ (b^ - A^cxc) (3.3b) 

(Acc - A^cA^^A^c) xc = be - A^cA^^b^ (3.3c) 

(I3.3cl) is the Schur complement system fS^, and (I3.3ap - p.3b|) recover the rest of the 
variables once it is solved. p.3p can be written more succinctly as 

X = Px'^ + Qb (3.4a) 

P :=n(Oz,-A;^cA^^,Ic)^ , Q :=ndiag{0z,A^^,02} (3.4b) 

A'^x'^ = b^ A'^ := P^AP , b'^ := P^b , := xc . (3.4c) 
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II is a permutation matrix such that II^^'x hsts the values of x at -2 nodes, J- nodes 
and C nodes in this order. (I3.4cl) is a smaUer Laplacian system for which another 
ehmination stage is performed: new Z and J- are identified, leading to a still-coarser 
p.4cp . and so forth, until Z — % and T is relatively small (cf. Algorithm [2] in 
principle, one can proceed until J- ^ % and maintain linear time by scanning only 
the neighbors of the previous stage's J-"-set in LowDegreeNodes{), but we haven't 
yet implemented this feature). For q elimination stages, the operators {Pi,Qi}f=i 
are lumped into composite P = PgPq_i • . . . • Pi and Q to avoid storing a Laplacian 
A"^ per stage. For instance, when q is large yet each stage eliminates only few nodes, 
storage would be prohibitively high. 



Algorithm 2 [A^ P, Q, {P,,, QJf^J = Elimination{A) 

1: A'^ ^ A, A/''= ^ nodes(A), ^ |7V=|, Q ^ 0„^, P ^ I„^, q ^ 0. 

2: while {ric > 1) do 

3; Z ^ {u G Af" : \Al\ = 0}, J" ^ LowDegreeNodesiA"). 
4: if ((Z = 0) and < Mric)) then 
5: return A^P,Q,{P„QJf^l. 
6; end if 

7: q^q + l,C ^J\f {ZUT). 

8: n ^ permutation matrix that transforms Af^' to the order (Z, J^, C). 
9: Q, 4- n • diag{Oz , (A^^)~i , 0^}, R ^ (02 , 0^ , Ic). 
10: P, ^ n . {Oz , -(A^c)'^(A^^)-i , Ic f- 

11: Q ^ Q + PQ,R, P ^ P.,P, A'^ ^ P^A^P, Af" ^ nodes(A'=), ^ IM" 

12: end while 



If no nodes are eliminated during EliminationO , I remains the coarsest level, 
otherwise (A^,P) defines the next level, I + 1. Either way, we again denote the new 
coarsest system by Ax = b to be further coarsened in ij §3.4l43?5] 

3.4. Aggregation. Eq. (|1.3aP is equivalent to the quadratic minimization 

X = min Etot (y) , Etot (x) := ■^x'^Ax - x^b . (3.5) 
y 2 

Let X be an approximation to x after several relaxation sweeps. The remaining error 
e := X — X of "special nature": its normalized residuals are much small than its 
magnitude [T^ §1-1] (assuming its mean of every connected component has been 
subtracted). Such errors are called algebraically smooth, or smooth for short, and 
are approximated by an interpolation from a coarse level, e « Pe"^. The variational 
correction scheme \12\ §4.5] finds the optimal correction in the energy norm, namely, 

e^ = argmin Etot (i + Py") ; (3.6) 

y^ 

The normal equations of this minimization are the Galerkin coarsening 

A"e" = b" , A" := P'^AP , b" := P^ (b - Ax) . (3.7) 
Once an approximation e^ to e"^ is computed, the fine-level approximation is corrected: 



X ^ x + Pe'^ . 



(3.8) 
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This algorithm depends only on P. Our particular P has caliber 1, which is equivalent 
to partitioning A/" into ric non-overlapping aggregates {Tu}u&N''=j where Tu is the set 
of e,('s interpolated from ejj [501 App. 9] with unit weights (cf. ij2.4p and M'^ := 
{1, . . . ,nc}. Each aggregate is composed of a seed node and zero or more associate 
nodes (Fig. 13. 2p . For simplicity, the seeds can be thought of as the coarse nodes, 
although ejj is just a degree of freedom that can be interpreted differently. The 
Galerkin operator computation is simplified and involves additions only: 

^uv=Y. E °™ ' U,VeM^. (3.9) 
Since P has unit row sums, is SPS and zero-sum and hence a Laplacian. 




(a) (b) 

Fig. 3.2. (a) An aggregation. Seeds are the semi-filled nodes. The interpolation copies the 
coarse seed value to its associates, (b) The coarse graph. Edge weights are computed by I3.9il . 



Intuitively, nodes should be aggregated together if their values are "close" . The 
next section introduces a node proximity measure. 

3.4.1. Affinity. Insofar as coarsening concerns the space of smooth error vectors 
X, nodes u and v should be aggregated only if x„ and x.y are highly correlated for all 
such X. We generate K Test Vectors (TVs) x'^^, . . . , x'^^ ~ a sample of this space [9l 
§17.2], [3^. Each TV is the result of applying v GS relaxation sweeps to Ax = 0, 
starting from random[— 1, 1]. Let XnxK '■— (x*^^' | • • • | x'^^). 

Since TVs are used to derive a coarsening of a modest coarsening ratio ndn 
(typically ^ to i), they need not be overly smooth nor numerous: v = 2, \s used at 
all levels; K — ^ TVs are employed to form the finest aggregation level, and an extra 
TV is added at each aggregation level. This incurs a small additional cost and seems 
useful, as coarse-level graphs are often increasingly more dense and complex. (The 
asymptotic GS solve vector obtained in §3.21 can be reused as a TV to save work.) 

The affinity Cuv between nodes u and v is the goodness of fit of fitting the linear 
model a^i, « Xu to TV values: 

I , ) I 



{XujXu) {Xy,Xy) 



(X,F) := ^XWyW . (3.10) 



fc=i 



For every u,v, c„„ = 1, < c„„ < 1 and c„„ = c^„. The affinity measures the strength 
of connection: the larger c„^,, the closer u and v. Let < 5 < 1; nodes u and v are 
called S -affinitives if 

Cuv > (5max< max c^s , max Cs^ > . (3.11) 
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This is analogous to the classical AMG definition (Table 12. 1[ top row) and works well 
in practice. Notwithstanding, alternative definitions that account for the closeness of 
Cuv to 1 need to be explored. 

Drawing an analogy to statistics, Xu can be thought of as a random variable^ Cuv 
is the coefficient of determination (R^) of linearly regressing Xu on Xy [53] using the 
TV sample. This interpretation leads to several observations. 

• 1 — Cuv is an alternative definition of the algebraic distance between u and 
V [40] . It is related to geometric distance in grid graphs. For instance, in 
the 1-D discretized Poisson equation with meshsize h = 1/n whose nodes are 
located at {uh}u=i, 1 — c™ oc {{u — v)h)'^ in the limit of /i — >■ 0. 

• Cuv is invariant to scaling Xu and which is vital in the two suns case 
(Fig.lOD)- 

• Cuv is unbiased by the sample size K , because we use the exact means (0) of Xu 
and Xv over all error vectors, rather than the sample means Xu '■— X^feLi ^^"^^ 
and Xv Indeed, the initial TVs are uniformly distributed, so the probabilities 
of starting from x and from — x are equal; since relaxation is a linear process, 
the probability of encountering a relaxed TV x equals the probability of — x, 
thus the mean of x„ over all possible TVs is 0. 

• In principle. Least-squares regression weights should be proportional to each 
measurement's reciprocal variance |23| . The variance of a TV is proportional 
to an appropriate norm of its normalized residuals, so that smoother the TV, 
the larger its weight ^ §17.2], JJj. In our case, all vectors have equal weights 
because they have the same smoothness level, i.e., comparable normalized 
residuals. This saves work and avoids biases that may occur if an improper 
weighting is applied. (For example, no TV should be given a much larger 
weight than all others, in which case it would dominate the regression and 
lead to a nonsensical c„„.) 

The work [ID] also defines the affinity of node m to a node set V. Here, the analogue 
is the coefficient of multiple determination of regressing Xu on {xv}v&v, 



CuV 



{Xu, Xu) 
{Xu, Xu) 



Xu qvXu , 



argmin 
q 



Xu — qvXv 



vev 



(3.12) 



As a byproduct we obtain the interpolation coefficients q from V to m. (|3.12p reduces 
to (j3.10p for V = {v}, where the corresponding regression model is Xu ~ qxy with 
q = {Xu, Xy)/ {Xy, Xv). Thus (|3.10p also works for non-zero row-sum M-matrices, 
e.g., restricted Laplacians [46l §2]. In the Laplacian, q is abandoned in favor of the 
theoretically known interpolation weight 1 (cf. §2.4^ : in non-zero-row sum cases, Puy 
is set to q (see also ^5.2\ [53]) . 

In the Helmholtz equation, Cuv is small for all u,v, indicating that all nodes are 
"distant" and that no single aggregation of the nodes will yield fast AMG convergence 
(indeed, it is known that multiple coarse grids are required |37|). 

3.4.2. Aggregation Rules. Ideal aggregates have strong internal connections 
and weaker external connections. To this end, we will be guided by four rules: 

1. Each node can be associated with one seed. 

2. A seed cannot be associated. 

3. Aggregate stronger affinitives before weaker. 

4. Favor aggregates with small energy ratios (cf. H3.5p . 
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Rules 1 and 2 prevent an associate from being transitively associated with multiple 
seeds. Otherwise, long chains of nodes might be aggregated together, creating ag- 
gregates with weak internal connections and very large energy ratios. Rule 3 favors 
strongly-connected aggregates. Rule 4 has dual purpose: (a) Ultimately, the energy 
ratio determines the AMG asymptotic convergence factor, hence this rule ensures good 
convergence, (b) Affinities are based on local information (relaxed TVs); their quan- 
titative value becomes fuzzier as nodes grow apart |40| §5]. Since small energy ratios 
usually lead to small aggregates, afRnities are indeed used only for local aggregation 
decisions. A typical coarsening ratio in our algorithm ranges between .3-. 5. 

3.4.3. Aggregation Algorithm. The main call aggregate^ gradually aggre- 
gates A/" in g stages, i.e., generates aggregate sets Si,...,Sq such that each 5*^- 
aggregate is contained in some Sj+i-aggregate (Algorithm [3]) . The affinity thresh- 
old 5 is monotonically decreased for stronger connections to be aggregated before 
weaker connections. Among the sets, we select that for which the coarsening ratio 
a :— \Si\/n is closest to Ofmax '■= g/j- This aims at bounding the total cycle work by 
« 1 -|- 7amax + (T^max)^ + ■ ■ ■ {1 — g)~^ fiuest-lcvel units, assuming the same fill-in 
at all levels (a more accurate a definition could be the ratio of coarse to fine edge 
numbers, which is easily tracked during aggregation). 

S is encoded by the status array: status (i) = Undecided := —1 denotes an 
undecided node, status{i) = Seed := a seed node, and status{i) > indicates that 
i is an associate of the seed status{i). High-degree nodes are apriori set to seeds, 
because a sun may only be another node's associate at a coarser level where its degree 
has dropped to 0(1). In each stage, we loop over undecided nodes and decide whether 
to aggregate each one with an existing seed, or with an undecided node that thereby 
becomes a new seed. 



Algorithm 3 5* = agffregQte(A, X, Qmax, /max) 

^ oo, n <- size(A). 
2: nc ^ n, a ^ 1, i ^ 0, (5 ^ .9. 

3; C <— {cuv)u,vi where Cuv ^ (I3.10p with test vectors X, V(m, v) £ £. 
4: For each u gU, status{u) ^ Vndecided, aggregateSize{u) <— 1. 
5; For each u € {u : \Au\ > 8 median({^^}^)}, status(u) ^ 0. 
6: while ((a > ctmax) and (i < /max)) do {Main aggregation loop} 
7: i ^ i + 1, (5 ^ .6(5. 

8: aQpreQationStaaei status . A , Tic, C, X, aggregateSize, 6) 
9: a <r- ric/n, ■<— (1 — a if a < amax, otherwise 1 + a). 

10: [Save current aggregate set] Si <— status. 

11: For each u e {v : St{v) = Seed}, Si{v) <— v. 

12: end while 

13: i ^ argmin(i?). 

14: Return 5,;. 



aggregationStage{) relies on bestSeed{) to locate the closest seed of each node. 
Note: while X and aggregateSize are modified in Algorithm^! neither is subsequently 
used. This anticipates an improved bestSeed{) implementation in the next section that 
does utilize the updated values. 

Parameters in the code are passed by reference when underlined, otherwise passed 
by value. 
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Algorithm 4 aqgreqationStaqei status , A , rir, C. X, aqgreqateSize, 5) 

1: [Strong neighbors] N <— {cuv > <^ max {maxg^u c„s , maxs^^, Cst;}}. 
2: U ^ {m : status{u) = Undecided}. 

3: for eacli m in U n {u : ^„ n N 7^ do {Undecided nodes with strong neighbors} 
4: if status{i) ^ Undecided then {i's status was changed earlier in this sweep} 
5: continue 
6: end if 

7: s -^r- bestSeed{A, X, aggregateSize, N, C, u). 

8: if s 7^ NotFound then {Identified seed neighbor, aggregate u with s} 

9: status{s) <— Seed, status{u) <^ s, ric ^ ric — 1. 
10: [Update TV values on new aggregate] For k — 1, . . . , K, Xuk Xsk- 
11: aggregateSize{u), aggregateSize{s) ^ aggregateSize{s) + 1. 
12: end if 
13: end for 



Algorithm 5 s <— bestSeed{A, X, aggregateSize, N, C, u) 
1: [seed & undecided (5- affinitives] 5 N n {w : status{u) G {Undecided, Seed}}. 

2; if = 0, return NotFound, else [Closest neighbor] return argmax Cuv end if 



3.5. Energy-Corrected Coarsening. 

3.5.1. Energy Inflation. The Galerkin coarse-level correction p.6p e"^ is the 
best approximation to a smooth error e in the energy norm. Braess [B] noted that 
this does not guarantee a good approximation in the I2 norm. For example, if e 
is a piecewise linear function in a path graph (1-D grid with w = \) coarsened by 
aggregates of size two, Pe'^ is constant on each aggregate and matches e's slope across 
aggregates, resulting in about half the fine-level magnitude; cf. Fig. 13.3b . 



Fig. 3.3. The Galerkin correction Pe'^ to a piecewise linear error e in a path graph for (a) 
uniform 1:2 aggregation and (b) variable aggregate size, h > is arbitrary. 

An equivalent but more useful observation is that the energy of PTe is twice 
larger than e's, where Te is a coarse representation of e, say, 

(Te)j, ^ e„ , [/ e (3.13) 
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T is called the aggregate type operator (alternatively, Te could be defined as the 
vector of seed values. Either definition satisfies TP = and falls under the umbrella 
of a corresponding CR theory [T^ §2]). Rewrite p.6|) as 

min |i(yYAV-y'P^r| , r:=Ae. (3.14) 

For an ideal interpolation P that satisfies PTe = e, (|3.14p is minimized by y"^ = Te. 
A caliber-1 interpolation P still satisfies PTe « e, but the first term in p.l4p is 
multiplied by the energy inflation factor 

'<^'-^^^- -<-):^i(e'rAV^ ,3,., 

Now p.l4p is minimized by y'^ w q^^Te. As e is not significantly changed by relax- 
ation, its two-level ACF will be p « 1 — 1/(7. In Fig. 13.3b . g « 2 and p ~ .5. 

3.5.2. Energy Correction. Several energy inflation remedies may be pursued: 

(A) Increase the interpolation's caliber and accuracy. This leads to the fill-in 
troubles discussed in ij2l 

(B) Accept an inferior two-level ACF of 1 — l/g and increase the cycle index 7 to 
maintain it in a multilevel cycle. Unfortunately, not only does this increase 
complexity, the examples of ij33?3ldemonstrate that q can be arbitrarily large. 
The ACF cannot be improved by additional smoothing steps either, because 
it is governed by smooth mode convergence. 

(C) Correct the coarse level operator A*^ to match the fine level operator's energy 
during the setup phase. 

(D) Modify the coarse level correction Pe'^ to match the fine level error e during 
the solve phase. 

In this section we consider option C. The Galerkin equation p.7p is modified to 

P^APe'^ = ^P^(b- Ai) . (3.16) 

The key question is how to choose /i. (Note that if n is constant. Options C and 
D are equivalent.) Motivated by Fig. 13.3b , and its two-dimensional analogue, Braess 
used /i = 1.8, but his V-cycle convergence for grid graphs was mesh-independent 
only if a fixed number of levels were used per cycle and if AMG was used as a FCC 
preconditioner. Moreover, no predetermined global factor exists that fits all error 
corrections in scenarios like Fig. 13.3b . because the coarse-level solution depends on 
local inflation ratios, which vary among graph nodes. 

On the other hand, a local energy correction ^ does exist. Indeed, the quadratic 
energies are separable to nodal energies 



E{x} = ^ Eu{x) , £'„(x) -i ^ auv {xu - x^f , (3.17a) 
i;^(x^)^ 5] £;^(x^), E^ui^^) -.^ J2 ^uv i^h - ^v)\ (3.17b) 
and define the local inflation factor as 



(3.18) 
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In principle, a local can be designed using our TVs to at least partially offset g^j; 
unfortunately, new difficulties arise (cf. tj5.3|) . Consequently, we chose to still scale all 
RHS entries by a flat fi, but modify the aggregation so that qui^) ~ Q for all smooth 
vectors x and all U G TV'^, where Q > 1 is a parameter. Under this condition a global 
factor is effective, whose optimal value minimizes the overall convergence factor: 



/X — argmm max 



1-^ 



argmm <^ |1 — /i| 

M>0 



1-^ 



2Q 
Q + 1 



(3.19) 



In our algorithm we set Q = 2 and /i = |. The expected smooth mode ACF is 

g/(g + i) = |. 

Alternatively, a dynamic (flat or local) fi can be computed for scaling the correc- 
tion Pe'^ during the solve phase; see §3.6.11 



3.5.3. Grid Examples. The local energy inflation varies considerably with ag- 
gregate size, shape and alignment. Consider the unweighted 2-D grid graph whose 
nodes are the 2-D coordinates {(ii, ^2) : ii, 12 £ in the plane ti-t2- Since all afhni- 
ties are equal, the aggregation algorithm of may locally create archetype constellations 
such as Fig. I3.4b ,-d depending on node ordering during the aggregation stages. 



[f- t i [f-^ ^^=^ 4 4 M ir^ ^ 



(a) 



(b) 



(c) 



(d) 



Fig. 3.4. 2-D grid coarsening patterns: (a) 1:2 semi-coarsening (Q = 2). (b) 1:3 semi- 
coarsening (Q = 3). (c) Full coarsening (Q = 2). (d) Staggered semi-coarsening (Q = 3). 



Low-energy vectors x are locally linear in ti and t2, so their nodal values are 
modelled by x{ti,t2) — aiti -I- a2t2. In Fig. 13.4b .. the worst local energy ratio Q at 
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any aggregate, say U = {5, 6}, is 



_B5(ai,a2) = i ^2 (x5 - x„)^ = 2ai + 2a2 , 

■ue{l,4,6,9} 

£'6(ai,c[2) = i ^ (x5 — x„)^ = 2ai + 2a2 , 



2 

u6{2,5,7,10} 

Elr{ai, 02) = i {(•5(x5 + xe) - .5(x3 + X4))^ + (.5(x5 + xe) - .5(x7 + xg))^ + 



2 (.5(x5 + xs) - .5(xi + X2))^ + 2 (.5(x5 + xe) - .5(x9 + xio))^} 
= 4a? + 2al , 

qu{ai,a2) = {4:al+2al) / [2al + 2al) Q = max §(7(01, 02) = ^[/(l, 0) = 2 . 



ai ,a2 



Similar analysis yields Q = 3 in Fig. 13.4b and Q = 2 in Fig. 13.4b . Small energy ratios 
thus occur when aggregates are small (consisting of 2-3 nodes each) and not elongated 
(a size-4 aggregate whose longest "side" is 2 is preferable to chain of size 3). 

We might conjecture that aggregates of size 2 as in Fig. 13.4b , always ensure Q « 2, 
but this turns out to be a fallacy: the staggered coarsening in Fig. 13. 4d manifests 
energy ratios as high as Q = 3. Even worse, its d-dimensional analogue yields Q = 
d + 1 - unbounded as d increases! Such aggregate constellations must be avoided. 
Fortuitously, we already possess the tool to signal and repel them: test vectors. 

3.5.4. Modified Aggregation Algorithm. The decision on aggregating u 
with a seed s during an aggregation stage (Algorithm will now be based on both 
affinities and local TV energy ratios, s is still required to be a (5-affinitive neighbor 
of u. Additionally, for each TV we compare the nodal energy Eu before and after 
aggregation, and aggregate only if the ratio is sufficiently small for all TVs. 

Note that the nodal energy p.l7ap is a quadratic in x„ and {xy}y£y{^ . We define 
the more general functional 



Eu{x;y) := ^a™?/^ -B„(x)y+C„(x) , B„(x) := ^ w^^Xy ,C^{x) ^ ^ 



2 



so Eui'x.) — Eui'x.; Xu)- The energy inflation is estimated by 



^uyXy , 



(3.20) 



■= 1?1<K F(xW-T ) ' ^" ~V ^ ■ ^ ^ ^ 

The numerator of qut is the post- aggregation energy. The denominator is the smallest 
possible local energy of the fine- level TV. This aims at reducing the high-energy modes 
in x^'^') that may obliterate the inflation estimate, and is equivalent to a temporary 
Jacobi relaxation step at u, which is also popular in recent bootstrap AMG works 
pn §2.2]. (A block-relaxation step at u and s could instead be applied to increase 
accuracy at a higher cost, but we have not pursued it in the "lean" spirit of LAMG.) 

hestSeedi) is replaced with a new implementation (Algorithm |6]): {qut\t are 
computed for all 5- affinitive seed neighbors t of m. Among the neighbors for which 
qut < 2.5, u is aggregated with that which represents the smallest aggregate (if the 
minimum is attained by multiple i's, one of them is arbitrarily chosen). If all qut > 2.5, 
u is not aggregated at all. Slightly larger ratios than the target Q = 2 are accepted 
because TVs also contain small-magnitude high-energy modes for which strict ratios 
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Algorithm 6 s <— bestSeed{A, X, aggregateSize, N, C, u) 
1: [seed & undecided (5- affinitives] 5 <— N n {w : status{u) G {Undecided, Seed}}. 

2: if = 0, reiurnNoxFouND; end if 

3: For eacii k^l,...,K, compute A„(x('=)), B„(x('')) using ((3?20)) . 

4: For eacii k = I, . . . ,K , Ft ^ £;„(xW; B„(xW)/a„„) using 

5: For eacii fc = 1, . . . , t e 5, ^ ^ KjxC^'; ^) using (IX^ . qt ^ c['"'^lFk. 

6: [Neighbors with small energy inflation] S {t ^ S : qt> 2.5}. 

7; if (S* = 0), return NotFound, else return argmin aggregateSize{v). end if 



are neither necessary nor attainable. This especially applies at coarse levels, where 
more TVs are used and the chance of increasing the maximum TV energy ratio rises. 

To save work, for each TV x'*^^ we first compute the terms Ba and C„ in p.20p . 
and subsequently evaluate the quadratic £^„(x'-'^-'; ?/) for each y = x'v\v G Au and 
y = jfu^ . Horner's rule [51 p. 8] is applied to save a multiplication: 

Eui^^''>;y) = Qa.,y - B.,, (x'^'))^ y + Cu (x^'^)) . (3.22) 

The complexity of bestSeed{) is therefore at most 0{K\Au\)- 

Two advantages of the low-degree elimination fi )3.3[) are (a) largely preventing 
worst-case 1-D scenarios such as Fig. 13.3b . where it is impossible to obtain low en- 
ergy ratios without excessively increasing the coarsening ratio; and (b) increasing 
the number of neighbors of u and the chance of locating a seed t with small energy 
inflation. 

3.5.5. Connected Component Assembly. Linear-time algorithms for identi- 
fying the connected components of a graph exist [IHl US] , and could be applied prior 
to LAMG invocation to reduce G to a singly-connected graph. This is optional, as 
LAMG naturally represents components as the disconnected nodes of each level. 

We start by computing the coarsest graph G^'s components {■^fm}m=i' ^-S-i using 
Tarjan's algorithm 05]. Let be the L*"^ level interpolation and {T^^^u^M^ be the 
corresponding aggregate set (cf. H3.4p . The set interpolation P^U of a set U'^ C Af^ 
is the support of its characterstic function's interpolant, 

P^W^ ■■= [j Tj^- (3.23) 

An ^^-edge between U and V cannot exist if Tjj and Ty are disconnected. Thus, 
G'~"'^'s connected components are the interpolants P^J\f^, m — 1, . . . , M^, plus any 
disconnected G'~^ nodes. All I — 1 components are interpolated to level I — 2 and 
amended with disconnected G'~^ nodes, and so on. When Z = 1 is reached, we obtain 
the set of connected components of G^ = G. See Fig. 13.51 and Algorithm [71 

3.6. Solve Phase. The solve phase consists of multigrid cycles [TH § 1.4]. Each 
I < L is assigned a cycle index 7' and pre- and post-relaxation sweep numbers I'i,'^!^- 
If ; + 1 is an Elimination level, 7' = 1 and iy[ = v!^ = 0; otherwise, 

y.^f7, l^'l>-l|^^l, ^i^^ ^i^ 2 (3 24) 

mm {2,g\£^+^\/\£^\} , otherwise, ^ ' ^ 
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Fig. 3.5. The three components of G' ({2}, {1,3} and {4}) are interpolated to the G' ^ 
components {3,4}, {1,2,5} and {6,7,8}. G^~'^ 's fourth component is {9}. 



Algorithm 7 N <— connectedComponents{{G'' ,P'^}f"^^) 

1: N ^ connected components of G^, using Tarjan's algorithm [I5] . 

2: for / = L, L - 1 , . . . , 2 do 

3: For each U e N,U ^ P^U. 

4: if I is an Elimination level then 

5; for each elimination stage (cf. i = q,q — 1, . . . ,1 do 

6: Zi ■(— be the set of disconnected nodes this stage. 

7: 7V^7Vu{p[_i-...-P^2:,|. 

8: end for 

9; end if 

10: end for 



At fine levels, 7 = 1.5 is employed; this value is theoretically marginal for a bounded 
multilevel ACF if the smoothest errors' two-level convergence factor is ~ i [T^ §6.2], 
which seems to be implied by (I3.19p . Notwithstanding, worst-case energy ratios occur 
infrequently in practice. This issue is further diminished by the adaptive energy 
correction of t )3.6.1l At coarse levels, 7' is increased to maximize error reduction 
while incurring a bounded work increase. The total cycle work is about 3/(1 — g) 
relaxation sweeps (cf. t j3.4.3p . so g = .7 seems like a reasonable value. 

Three GS sweeps per level provide adequate smoothing, especially in light of the 
coarse-level correction's crudeness. It may be possible to escape with fewer sweeps, 
but the relaxation work is anyway dominated by the setup time, as illustrated in tj4.1l 

Note that we use fixed settings for all cycle parameters: no optimization or fine 
tuning is required for specific graphs. 

If level 2 is an Elimination, b is restricted to b'^ only once, and all cycles are 
applied to A^x^ = b^, followed by interpolating the final to x. 

3.6.1. Adaptive Energy Correction. Instead of fixing /i (cf. one can 

modify the correction to smooth errors during the cycle. Let I be any level such that 
Z -I- 1 is an Aggregation level. Whenever we transition to level I from level ^ — 1, 
1? sub-cycles are applied to A'x' = b', where ?9 is 1 or 2. We save the iterants x^ 
obtained after the pre-relaxation of each sub-cycle. Before switching back to level 



LEAN ALGEBRAIC MULTIGRID (LAMG) FOR THE GRAPH LAPLACIAN 



19 



I — 1, the final iterant x' is replaced with y', where 



y' = x' + ai {x[ - x') + ■ • • + (xj, - x') 



(3.25) 



and {ai}f^i are chosen such that ||b' 



-aVi 



2 is minimized (this is an ni x d least- 



squares problem that is solved in Oini) time). This iterant recombination 'W, §7.8.2] 
diminishes smooth errors x' — x' that were not eliminated by (Z + l)*^-level corrections. 
Since the initial residuals obtained after interpolation from level I + 1 are not smooth, 
a residual minimization is only effective after x' — x' is smoothed. To maximize iterant 
smoothness, more relaxation sweeps are performed after returning from a coarse level, 
hence v\ — l,v\ — 2 m (|3.24p (some pre-smoothing is still needed: this choice was 
superior to v\ = Q^V2 = i in numerical experiments). 

This acceleration is superior to CG because it is performed at multiple levels. 
Iterant recombination at coarse levels was first suggested by A. Brandt and has been 
recognized as an effective multigrid tool [50, Remark 7.8.5]. In LAMG, recombination 
occurs more frequently at coarser levels (since 7 > 1), where it is less expensive. 
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Fig. 3.6. A four-level cycle. A squared number denotes a number of GS sweeps; E denotes an 
exact solver (^3^^^; Z denotes the Grahm-Schmidt loop i3.2(H) : down-arrows denote RHS restric- 
tions | [3.4c[ ) or I3.16\l : up-arrows denote corrections i3,3a\l - i3. 361) or Ii3.8\ ). Iterants are saved at 
the black dots before the restriction; denotes a (i5 + l)-iterant recombination ^3.25}} . 



3.6.2. Zero-Mode Orthogonalization. The null-space components of x are 
not determined by the cycle. At the end of each cycle, we add the Grahm-Schmidt 
procedure 

T 

Form = l,...,M, x ^ x + ~ """" u„ . (3.26) 

Note that the zero modes Ui , . . . , um are known once the connected components are 
assembled. (Alternatively, the cycle could be reformulated using the full approxima- 
tion scheme. The constraints (jl.3bp would be transferred to the coarsest level and 
solved concurrently with (|1.3ap [HJ §5.6].) 

3.6.3. Coarsest Solver. If relaxation converges fast at a certain level, the linear 
system is iteratively solved to sufficient accuracy by GS solve iterations fi i3.2p . If it is 
slow, the coarsest graph should be small enough for a direct solver to be fast. This 
size is implementation-specific; in our program, coarsening terminates when < 150 
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and the augmented system 




(3.27) 



is directly solved. is a matrix whose columns are the characteristic functions of 
the connected components Ni , . . . , A/j^t of . The dual vector is discarded. 

4. Numerical Results. Next, we provide supporting evidence for LAMG's 
practical efficiency for a wide range of graphs. 

4.1. Smorgasbord. An unoptimized LAMG object oriented Matlab 7.10.0 
(R2010a) serial implementation was developed and tested on a diverse set of 1666 real- 
world graphs with up to six million edges, collected from the following sources: 

• The University of Florida Sparse Matrix collection (UF) [20] . 

• C. Walshaw's graph partitioning archive [5T| . 

• I. Safro's MLogA results archive at Argonne National Laboratory [32] ■ 

• The FTP site of the DIMACS Implementation Challenges [5T]. 

The graphs originated from a plethora of applications: CFD airplane and car FEM 
meshes; RF electrical circuits; combinatorial optimization; model reduction bench- 
marks; social networks; web page networks; and many others. Experiments were 
performed on a 64-bit Windows 7 DeU Inspiron 580 (3.2 GHz CPU; 8GB RAM). 

For each graph, a compatible RHS was generated by identifying a pair of nodes 
s, i in the same connected component and setting 6^ = 1, 6t = — 1 and hu = 0, 
u ^ {s,t] (Ax = b models an effective capacitance problem §2]). LAMG setup 
was executed with two aggregation stages at each level. That is, (5 = .9 was set in 
the first stage of Algorithm [31 and if a second stage was performed, it used 5 — .54. 
The solve phase was then invoked twice, with a flat M — f energy correction and an 
adaptive correction; each solve started from a initial random guess and proceeded till 
the residual Z2-norm was reduced by 10^. Five performance measures were computed 
for each graph: 

• Setup time per edge tgotup [seconds]. 

• Solve time per edge per significant figure tgoivc [seconds], for the adaptive 
scheme. If residual norm after i iterations was and p iterations were exe- 
cuted, tsoive := ^/("^logio(^o/''p))i where t was the total solve time. 

• Total time per edge itotai = ^sotup + lOisoivo [seconds]. 

• Asymptotic Convergence Factor (ACF) of the flat and adaptive schemes, es- 
timated as (rp/ro)^/^. 

• Adaptive correction gain A: the ratio of flat-to-adaptive solve times. 
LAMG scaled linearly with graph size: both igotup and tgoivc were approximately 

constant, and the total time per edge was 2.9 • 10"^ seconds on average. Stated 
differently, LAMG performed a single linear solve to 10 significant figures at 33, 000 
edges per second. See Fig. I4.1b .c.d and Table |44] 

Adaptive energy correction provided a 15% — 20% speed up and was superior for 
almost aU graphs (cf. Fig. 14.1b '). The total time comprised 70% setup and 30% solve 
for the flat scheme and 75%/25% for the adaptive scheme. The respective average 
ACFs were .18 and .048. These results were better than expected, possibly because 
of the specific nature of graphs in the collection. We therefore also include details on 
the typically harder graphs for LAMG in Table 14.21 
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# Edges # Edges 



(c) (d) 

Fig. 4.1. LAMG performance vs. number of edges for graphs with 5,000 edges and larger: 
(a) Total time per edge ttotal- (b) Adaptive correction solve time gain A. The dashed line is the 
break-even point A = I. (c) normalized setup time tsetup. (d) Normalized solve time tsolve- 



]\It!a.surc' 


Rkxliaii 


Rk-au ± Sid. Dev. 


Total time [sec] 


2.4 ■ 10"-' 


2.9 ■ IQ--'' ± 2.5 ■ 10--'' 


Setup time [sec] 


1.8 ■ 10--'' 


2.1 ■ IQ-" ± 1.5 ■ 10--' 


Solve time per figure [sec] 


Flat M = 1 


6.2 ■ lO-'^ 


1.3 ■ 10-** ± 1.7- 10-f' 


Adaptive 


5.3 • 10"' 


8.7- 10"' ± 1.0 • 10"° 


ACF 


Flat M = 1 


.180 


.239 ± .209 


Adaptive 


.048 


.069 ± .068 


Adaptive gain A 


1.170 


1.238 ± .302 



Table 4.1 

Median and mean LAMG performance measures over all graphs with 5, 000 or more edges. 
ACF statistics only include graphs for which more than one level was constructed. 



Name 


n / m 


M 


L 


ACF 


^total 


itotal 












Flat 


Adaptive 








Stanford website 


668925/3732100 


5011 


17 


.381 


.114 


85.9% 


1.9 ■ 10" 


5 


Calif, road network 


1965206/2766607 


2638 


19 


.423 


.198 


81.0% 


3.5 ■ 10- 




GaAsHe molecule 


61349/1660230 


1 


9 


.144 


.053 


78.9% 


1.6 ■ 10" 


5 


Citeseer citations 


268495/1156647 


1 


13 


.519 


.175 


83.5% 


2.7- 10" 


5 


Amazon sales 


400727/1049624 


1 


14 


.449 


.172 


81.2% 


3.2 • 10" 


5 


RF circuit device 


74044/207323 


1 


14 


.211 


.102 


73.4% 


7.8 • 10- 


5 


Optimization problem 


8364/105723 


1 


11 


.173 


.056 


72.3% 


2.2 • 10" 


5 



Table 4.2 



LAMG performance for several graphs representative of the harder cases for LAMG. 
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4.2. Example of a Hierarchy. For the UF graph AG-Monien/airf oill-dual 

- a 2-D air foil finite element singly-connected graph, an eight-level LAMG hierarchy 
was constructed (see Fig. 14.21 and Tabl d4.3) ). The total number of edges at all levels 
(measuring the required storage) was three times the finest graph size. Cycle ACFs 
were .358 and .157 for the flat and adaptive energy corrections, respectively. 




Fig. 4.2. The four finest non- elimination level graphs for the 2-D airfoil problem. 



I 


Type 


n / m 


Mean Degree 


u[ + u'2 


7' 


K 


1 


Finest 


8034/11813 


2.94 





1.0 





2 


Elimination 


3858/11279 


5.84 


3 


1.5 


8 


3 


Aggregation 


1561/4523 


5.80 





1.0 





4 


Elimination 


1323/4059 


6.14 


3 


1.5 


9 


5 


Aggregation 


615/1754 


5.70 





1.0 





6 


Elimination 


472/1451 


6.14 


3 


1.5 


10 


7 


Aggregation 


198/530 


5.36 





1.0 





8 


Elimination 


114/339 


5.94 


3 


2.0 


11 



Table 4.3 

LAMG level hierarchy for the 2-D airfoil problem. Graphs were drawn using GraphViz 2.28.0 
with the SFDP algorithm 



4.3. Grids with Negative Weights. Like Bootstrap AMG [HI HO], LAMG 

is not restricted to M-matrices and also applies to some graphs with negative edge 
weights Wuv , as long as the Laplacian matrix is (or very close to being) positive semi- 
definite. This distinguishes LAMG from graph-theoretic works [171 and classical 
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AMG. To demonstrate this capability, we tested LAMG on the following 2-D grid 
Laplacians, whose stencils are depicted in Fig. [ 



(a) The standard 5-point finite-difference (FD) discretization of Uxx + Uyy on the 
unit square with Neumann boundary conditions (BC). 

(b) The 13-point 4*''-order FD stencil of U^x + Uyy. 

(c) The discretized anisotropic-rotated Laplace operator 

(cos^(a) -I- esin^(Q!)) Uxx + (1 ^ S:U\{2o)Uxy + (ecos^(a) -I- sin^(a)) Uyy , 

(4.1) 

with a = — 7r/4, e = 10 standard 5-point stencil of Uxx,Uyy, and an 
alignment-agnostic cross-term 



Ux 



1 

4/^2 



-1 



1 



where h is the grid meshsize. Neumann BC were enforced, 
(d) The same as (c) , but aligning the cross-term with the northeast and southwest 
neighbors: 



U^ 



xy 



1 

2h^ 



-1 1 
2 -1 
-1 



(e) The Finite-Element (FE) discretization of the Laplace operator with stretched 
quadrilateral elements {hx/hy — oo) with periodic boundary conditions. 

(f) The 13-point FD discretization of the biharmonic operator A^C/ with the 
boundary conditions dU/dn — d^U/d-n? = 0. 



.24998 
.50005 
.24998 



(a) 
-.50005 
2.0002 
-.50005 

(c) 



.24998 
-.50005 
-.24998 



-1 


-4 


-1 


2 


8 




-1 


-4 


-1 




(e) 





-1 




-16 




1 -16 60 


-16 1 


-16 




1 




(b) 




-1 


.49995 


-1 3.0001 


-1 


.49995 -1 




(d) 




1 




2 -8 


2 


1 -8 20 


-8 1 


2 -8 


2 


1 




(f) 





Fig. 4.3. Stencils of grid Laplacians with negative weights. Entries are normalized to a 
meshsize-independent sum and rounded to five significant figures. 

Problems (c) and (d) are bad discretizations that do not align with the characteristic 
direction of (|4.1I) . (c),(d) and (e) are considered hard for AMG |10) . 

LAMG exhibited mesh-independent convergence and total time in all cases and 
scaled linearly with grid size. Performance figures are given in Table IT4l 

While the paper [TU] focused on accurately finding the characteristic directions of 
(c)-(e) without sparing setup costs and only presented two-level experiments, LAMG 
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Problem 


m 


L 


ACF 


^total 


*total 


Flat 


Adaptive 


(a) 5-point 


523264 


19 


0.279 


0.136 


71.7% 


4.9 X lO-^' 


(b) 13-point 4*'' order 


1567746 


17 


0.358 


0.196 


71.5% 


7.4 X 10-^ 


(c) Anis. rot., agnostic 


1045506 


16 


0.763 


0.713 


45.6% 


8.2 X 10"^ 


(d) Anis. rot., misaligned 


784385 


17 


0.763 


0.680 


49.5% 


1.3 X 10-* 


(e) Stretched EE 


1567746 


17 


0.807 


0.725 


39.4% 


9.3 X 10-^ 


(f) Biharmonic 


1048576 


14 


0.789 


0.731 


48.3% 


7.0 X 10"^ 



Table 4.4 

LAMG performance for grid graphs on a 512 X 512 grid with n = 262144 nodes. 



is a full multi-level method with a far shorter setup time whose ACF is reasonable in 
all cases, albeit this ACF can also be significantly reduced using bootstrap tools. 

The biharmonic case is intriguing. Even though GS is not the best smoother [121 
§20.3.2] and the interpolation must be second-order [7, §3.3.5], the caliber- 1-based, 
black-box LAMG performed reasonably well without any tuning. 

These results are certainly only preliminary; further research should be directed 
toward improving the convergence rates in cases (c)-(f). 

4.4. Lean Geometric Multigrid. Higher performance for the Poisson equation 
on uniform grid can be obtained by a standard 1:2 coarsening in every dimension at all 
levels and employing GS Red-Black (RB) relaxation [T^l §3.6]. LAMG then reduces 
to Lean Geometric Multigrid (LMG): standard multigrid \12\ §1] cycle with index 
7 = 1.5, first-order transfers and energy-corrected coarsening. Since the energy ratio 
is 2 for all error modes, we employ a fiat correction /i = 2 in Eq. p.l6p . 

As a basic experiment, we compared LMG's performance to the "classical" cycle 
with GS-RB, linear interpolation and second-order full weighting in two-level Local 
Mode Analysis (LMA) and multilevel experiments with seven levels on a 128 x 128 
grid for the 2-D periodic Poisson problem. The LMG (l,2)-cycle turns out to be a 
record-breaking Poisson solver in terms of asymptotic efficiency at a .5 convergence 
per unit work, versus the classical V(l,l) at .67 (cf. Table |475)) . 



Experiment 


Method 




(1,0) 


(1,1) 


(1,2) 


(2, 2) 


Two-level LMA 


LMG 


.500(1.5) 


.125(2.5) 


.034 (3.5) 


.024 (4.5) 


Classical 


.250(3.3) 


.074 (4.3) 


.051 (5.3) 


.039(6.3) 


Multilevel ACF 


LMG (7 = 1.5) 


.470(2.1) 


.119(3.6) 


.032(5.0) 


.022(6.4) 


Classical V-cycle 


.261 (4.4) 


.101 (5.7) 


.060(7.1) 


.045(8.4) 



Table 4.5 

LMA predictions and actual asymptotic convergence of LMG vs. classical multigrid for the 2-D 
periodic Poisson problem. The ACF and cycle work in relaxation units (in parentheses) are listed 
for each case. 

This caliber- 1 "super-efficiency" for the periodic Poisson problem can be explained 
by interpreting the grids of the hierarchy as cell-centered discretizations at meshsize 
h, 2h, Ah, . . . , for which the LMG restriction operator is effectively second-order when 
fi — 2. Although the interpolation is still first-order, LMG still satisfies the transfer 
operator rules for optimal multigrid efficiency |121 §4.3]. 

Other boundary conditions impede the choice of fi. While supplementary local 
relaxations near boundaries theoretically ensure attaining the two-level rates, it would 
be more beneficial to study the performance of adaptive energy correction in LMG. 
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5. Extensions. Enhancements and adaptations of the LAMG approach to re- 
lated computational problems are outlined. 

5.1. High Performance Implementation. The main bottleneck of the cur- 
rent Matlab implementation is the aggregation, specifically, locating the neighbors 
Au of node u and the corresponding graph weights, needed for the various term com- 
putations in Algorithm [HI This is because the aggregation stage sequentially scans 
and updates nodes while Matlab excels at vector operations. An informal survey 
of language operation benchmarks seems to indicate that a C or C-l — h implementa- 
tion could provide a 2-4 speed-up. The mex interface can be used to program these 
internal loops in C/C-I--I- while continuing to use Matlab for the entire program [55] . 

AMG parallelization to multiple processors is nontrivial. Lessons from other 
parallel AMG works [H |3j will likely directly apply to LAMG. Regarding LAMG- 
specific operations, TVs and affinities can be computed in parallel; parallelizing the 
aggregation stage will again pose the greatest challenge. Aggregation decisions may 
need to be modified to be symmetric to avoid conflicts near processor boundaries. An 
important advantage is LAMG's tendency to reduce the size of the coarse stencils, 
which should help reduce the overlap between the sub-graphs assigned to neighboring 
processors, as well as communication at the coarsest levels. 

We plan to compare LAMG's performance within existing AMG infrastructures 
such as Hypre and Trilinos-PETSc. Like classical AMG solvers, LAMG can be incor- 
porated as either a stand-alone solver or as a CG preconditioner. 

5.2. Coarsening Improvements. The final word has by no means been said 
on the details of the LAMG algorithm of ij3l 

• Currently, u can only be aggregated with a directly- neighboring seed s. In 
some problems, one should also search within u's A^-neighbors to construct a 
good aggregation. For instance, in the anisotropic-rotated problem Fig. I4.3t i. 
u should be aggregated along the characteristic direction, i.e., with its south- 
east (or northwest) neighbor, neither of which is contained in A. 

• During the aggregation stage (Algorithm |4]), a clever ordering of the unde- 
cided nodes can winnow out "holes" , i.e., lone nodes that cannot be associated 
just because all their neighbors have been aggregated, and no aggregate en- 
largement can be warranted by the energy ratio control. 

• If no small energy ratio can be found, or if subsequent cycle convergence is 
slow, isolated bottleneck nodes may be de-aggregated. 

• Bigotry of the "lean" approach should not be practiced, either: one can 
occasionally up the interpolation caliber at these troublesome nodes, provided 
that this does not substantially increase the total coarse edges. 

• Since the coarsening algorithm utilizes both affinity and energy ratio condi- 
tions, it may be possible to reduce the number of TVs K and/or the number 
of relaxation passes ly (or trade one for the other to increase efBciency at a 
given complexity). 

5.3. Local Energy Corrections. Instead of a flat /i = | factor in p.l6p . one 

can apply different ^'s to different aggregates. We experimented with different energy 
correction schemes, some based on fltting the coarse nodal energies of TVs to their 
flne counterparts (Eq. p.l7bp ). While this can dramatically curtail energy inflation, 
care must be taken to avert over-fitting that ultimately results in the coarse-level 
correction operator's instability. 

Analogously, one can define a local adaptive fi in the MINRES procedure of 
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ti3.6.1l at each level I. For example, /i may be constant on the nodes of each level / + 1 
aggregate. ^ should be smoothed by (say) a GS relaxation sweep on A'+^/x = 0, as 
it multiplies a smooth correction vector and cannot be allowed to radically oscillate. 

It is unclear whether the extra work would be justified in either case; it may be 
useful for problems re-solved for many RHS vectors or when a larger setup overhead 
is tolerable. 

5.4. The Eigenvalue Problem. AMG can be nicely combined with the Exact 
Interpolation Scheme (EIS) 34 to find the smallest nonzero eigenvalue and associated 
Fiedler vector. After the LAMG setup phase, multilevel EIS cycles can be applied 
to Ax = Ax using the sparsity pattern of P and A at all levels. After the current 
approximation x is relaxed at level I, the interpolation weights are re-set to pl^^ = 
Xu/xy (and possibly modified back to 1 near i's zeros), followed by recomputing 
j^i+i ^ (p')Ta'p' as well as coarsening the mass matrix B'+^ = (P')^B'p'; := I. 
At the coarsest level, the lowest eigenpair of the pencil (A'^,B'^) is calculated. The 
same 7, i^i, can be used as in the linear solver. 

The reasoning behind this algorithm is that a piecewise-constant interpolation 
fits all near-null-space eigenvectors of A, not just the constant vector. Hence, large 
affinities indicate large correlations among the nodal values of the Fiedler vector, so 
the coarsening pattern is adequate for its computation. EIS is particularly attractive 
thanks its super-linear convergence. 

Alternatively, one can incorporate the LAMG linear solver into a Rayleigh Quo- 
tient iteration, which converges locally cubically to the smallest eigenvalue [29l §8.2]. 

We plan on developing the LAMG eigensolver (including its generalization to find- 
ing several lowest eigenpairs) in the near future, because the Laplacian eigenproblem 
is even more ubiquitous in applications than the linear system. 

5.5. Other Linear Systems. The caliber-1 algorithm can be applied to non- 
zero row sum matrices, except that the interpolation weights are no longer 1. The 
affinity definition p.lO|) remains intact, and the corresponding P entry is set to 

Puv := argmin ||A„ - qX^Wl^ (5.1) 

(cf. p.l2p . Normally, relaxed TVs yield an accurate enough in problems with 
almost-zero modes, e.g., the QCD gauge Laplacian, TVs may need to be improved by 
a bootstrap cycle [TT] . 

Further research should be conducted for negative- weight graphs such as the high- 
order finite element and anisotropic grid graphs of ij4.3l The reported convergence 
factors can be reduced by producing bootstrapped TVs via applying multilevel cycles 
to Ax = 0. The cycle is far more powerful than plain relaxation in damping smooth 
characteristic components, which should lead to more meaningful algebraic distances 
and to the correct anisotropic coarsening in a second setup round (much larger spacing 
in the characteristic direction and no coarsening in the cross-characteristic direction). 
The bootstrap procedure should be similarly useful for many other graphs. 

6. Conclusion. Over the last decade, Laplacian matrices have attracted increas- 
ing attention because they underlie a plethora of graph computational applications 
ranging from genetic data clustering to social networks to fluid dynamics. To the 
best of our knowledge, the presented algorithm. Lean Algebraic Multigrid (LAMG), 
is the first graph Laplacian linear solver demonstrated to scale linearly with graph 
size. LAMG will hopefully pave the way to significant speed-ups in those applications. 
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